Code
library(tidyverse)
library(here)
# library(lubridate)
library(tsibble) # new lib!
library(feasts)
library(fable)library(tidyverse)
library(here)
# library(lubridate)
library(tsibble) # new lib!
library(feasts)
library(fable)Toolik Station (LTER) meteorological data (Source: Source: Shaver, G. 2019. A multi-year DAILY file for the Toolik Field Station at Toolik Lake, AK starting 1988 to present. ver 4. Environmental Data Initiative.) See the Toolik Field Station Environmental Data Center site for more details.
toolik_df <- read_csv(here("data", "toolik_daily.csv"))Go ahead and try plotting the data as imported.
ggplot(data = toolik_df, aes(x = date, y = daily_air_temp)) +
geom_line()
### Booo we get a warning (only one observation per series)Notice that it doesn’t work - because R doesn’t understand the date is a date until we tell it.
Let’s go ahead and convert it to a tsibble using the as_tsibble() function. First, we’ll need to convert the date to a date class, then convert to a tsibble. We could just keep it as a regular dataframe with a date column and do a lot with that, but the tsibble package gives lots of functionality around time series (thus the ts at the start).
toolik_ts <- toolik_df %>%
mutate(date = lubridate::mdy(date)) %>%
as_tsibble(key = NULL, ### if we had multiple obs on same date from diff locations
index = date) ### time index, here our column is `date`Now let’s plot it:
ggplot(data = toolik_ts, aes(x = date, y = daily_air_temp)) +
geom_line() +
labs(x = "Date",
y = "Mean daily air temperature (Celsius)\n at Toolik Station")We need to ask some big picture questions at this point, like:
filter_index() to filter by date-times!We can use filter_index() specifically to help us filter data by time spans. See ?filter_index() for more information.
Formulas that specify start and end periods (inclusive), or strings.
~ end or . ~ end: from the very beginning to a specified ending period.
start ~ end: from specified beginning to ending periods.
start ~ .: from a specified beginning to the very end of the data. Supported index type: POSIXct (to seconds), Date, yearweek, yearmonth/yearmon, yearquarter/yearqtr, hms/difftime & numeric.
So for example “2010-12” ~ “2011-01” would filter from December 2010 through January 2011.
toolik_ts %>%
filter_index("2010-12","2011-01") %>%
tail()# A tsibble: 6 x 5 [1D]
date daily_air_temp daily_precip mean_barom mean_windspeed
<date> <dbl> <dbl> <dbl> <dbl>
1 2011-01-26 -22.5 NA 911. 1.19
2 2011-01-27 -18.5 NA 926. 2.17
3 2011-01-28 -12.0 NA 933. 3.36
4 2011-01-29 -4.85 NA 928. 7.33
5 2011-01-30 -6.68 NA 925. 4.90
6 2011-01-31 -6.71 NA 918. 5.93
Try to filter the Toolik tsibble from April 10, 2006 to May 15, 2006.
toolik_ts %>%
filter_index("2006-04","2006-05-15")# A tsibble: 31 x 5 [1D]
date daily_air_temp daily_precip mean_barom mean_windspeed
<date> <dbl> <dbl> <dbl> <dbl>
1 2006-04-01 -7.33 NA NA NA
2 2006-04-02 -9.52 NA NA NA
3 2006-04-03 -19.0 NA NA NA
4 2006-04-04 -26.0 NA NA NA
5 2006-04-05 -27.6 NA NA NA
6 2006-04-06 -27.5 NA NA NA
7 2006-04-07 -21.6 NA NA NA
8 2006-04-08 -15.4 NA NA NA
9 2006-04-09 -15.8 NA NA NA
10 2006-04-10 -18.6 NA NA NA
# ℹ 21 more rows
Filter another index from December 20, 2020 to the end of the dataset.
toolik_ts %>%
filter_index("2020-12-20"~.)# A tsibble: 742 x 5 [1D]
date daily_air_temp daily_precip mean_barom mean_windspeed
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-12-20 -39.0 0 909. 0.652
2 2020-12-21 -26.2 0.2 912. 2.25
3 2020-12-22 -7.97 1.5 908. 6.33
4 2020-12-23 -6.79 2.2 901. 2.49
5 2020-12-24 -7.44 0.7 913. 4.51
6 2020-12-25 -8.26 0 918. 3.01
7 2020-12-26 -21.4 0.1 917. 1.59
8 2020-12-27 -20.4 0 916. 1.11
9 2020-12-28 -15.0 0.3 916. 1.20
10 2020-12-29 -16.0 0.1 920. 1.22
# ℹ 732 more rows
filter_index() also accepts multiple periods using commas to separate the intervals. Try to filter from December 2010 through January 2011, and from September 2, 2013 to October 31, 2014.
Don’t worry about saving the output to another object.
toolik_ts %>%
filter_index("2010-12"~"2011-01", "2013-09-02"~"2014-10-31")# A tsibble: 487 x 5 [1D]
date daily_air_temp daily_precip mean_barom mean_windspeed
<date> <dbl> <dbl> <dbl> <dbl>
1 2010-12-01 -21.8 0.5 928. 4.23
2 2010-12-02 -23.4 0.3 927. 2.36
3 2010-12-03 -17.9 0.2 921. 2.63
4 2010-12-04 -24.0 1.7 919. 3.70
5 2010-12-05 -31.7 0 936. 2.39
6 2010-12-06 -23.5 0 940. 3.07
7 2010-12-07 -21.3 0 936. 3.50
8 2010-12-08 -22.1 0 932. 3.77
9 2010-12-09 -18.7 2.2 930. 3.35
10 2010-12-10 -26.0 0.4 929. 3.37
# ℹ 477 more rows
index_by() to aggregate time series by incrementsindex_by() replaces group_by() for time series data. After the data is indexed, we can still use summary, but now we’re doing it across dates.
toolik_month <- toolik_ts |>
index_by(yr_mo = ~yearmonth(.)) |>
summarize(monthly_mean_temp = mean(daily_air_temp, na.rm = TRUE)) |>
ungroup() ### just like after group_by()Now let’s take a look:
toolik_month %>%
ggplot(aes(x = year(yr_mo), y = monthly_mean_temp)) +
geom_line() +
facet_wrap(~month(yr_mo, label = TRUE)) +
labs(x = "Year",
y = "Annual mean air temperature (Celsius)",
title = "Toolik Station mean annual air temperature",
subtitle = "1988 - 2018")There are some other examples of index_by() in the fable package documentation. Look at the key for more examples.
To reinforce skills for wrangling, visualizing, and forecasting with time series data, we will use data on US residential energy consumption from January 1973 - September 2023 (from the US Energy Information Administration).
Read in the energy.csv data (use here(), since it’s in the data subfolder).
energy_df <- read_csv(here("data", "energy.csv"))Examine patterns and trends in residential energy consumption over time
Predict where residential energy use patterns will be five years from now
Explore the energy_df object as it currently exists. Notice that there is a column yyyymm that contains the 4-digit year and 2-digit month. Currently, however, R understands that as a character (instead of as a date). Our next step is to convert it into a time series data frame (a tsibble), in two steps:
energy_ts <- energy_df %>%
mutate(date= yearmonth(yrmonth)) %>%
as_tsibble(index=date,
key=sector) # group by sector if same dateLet’s take a quick look at our tsibble (for residential energy use, in trillion BTU):
ggplot(data = energy_ts, aes(x = date, y = energy_total, color = sector)) +
geom_line() +
labs(y = "Energy consumption by sector \n (Trillion BTU)")Looks like there are some interesting things happening. Focus on residential:
A seasonplot can help point out seasonal patterns, and help to glean insights over the years. We’ll use feasts::gg_season() to create an exploratory seasonplot, which has month on the x-axis, energy consumption on the y-axis, and each year is its own series (mapped by line color).
energy_ts %>%
filter(sector == 'residential') %>%
gg_season(y = energy_total, pal = hcl.colors(n = 9)) +
theme_minimal() +
labs(x = "month",
y = "residential energy consumption (trillion BTU)")This is really useful for us to explore both seasonal patterns, and how those seasonal patterns have changed over the years of this data (1973 - 2023). What are the major takeaways from this seasonplot?
Let’s explore the data a couple more ways:
energy_ts %>% gg_subseries(energy_total)Our takeaway here is similar: there is clear seasonality (higher values in winter months), with an increasingly evident second peak in June/July/August. This reinforces our takeaways from the raw data and seasonplots.
See Rob Hyndman’s section on STL decomposition to learn how it compares to classical decomposition we saw in lecture: “STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using LOESS”, while LOESS is a method for estimating nonlinear relationships.” LOESS (“Locally estimated scatterplot smoothing”) uses a weighted moving average across all points in the dataset, weighted by distance from the point being averaged.
Notice that it allows seasonality to vary over time (a major difference from classical decomposition, and important here since we do see changes in seasonality).
# Find STL decomposition
dcmp <- energy_ts %>%
filter(sector == 'residential') %>%
model(feasts::STL(energy_total ~ season(period = '1 year') + trend(window = 25)))
# Visualize the decomposed components
components(dcmp) %>%
autoplot() +
theme_minimal()We use the ACF to explore autocorrelation (here, we would expect seasonality to be clear from the ACF):
energy_ts %>%
filter(sector == 'residential') %>%
ACF(energy_total) %>%
autoplot()And yep, we see that observations separated by 12 months are the most highly correlated, reflecting strong seasonality we see in all of our other exploratory visualizations.
Note: here we use ETS, which technically uses different optimization than Holt-Winters exponential smoothing, but is otherwise the same (From Rob Hyndman: “The model is equivalent to the one you are fitting with HoltWinters(), although the parameter estimation in ETS() uses MLE.”)
To create the model below, we specify the model type (exponential smoothing, ETS), then tell it what type of seasonality it should assume using the season("") expression, where “N” = non-seasonal (try changing it to this to see how unimpressive the forecast becomes!), “A” = additive, “M” = multiplicative. Here, we’ll say seasonality is multiplicative due to the change in variance over time and also within the secondary summer peak, and trend is additive:
# Create the model:
energy_fit <- energy_ts %>%
# filter_index('2000-01' ~ .) %>%
### try different date windows since trend seems to change
filter(sector == 'residential') %>%
group_by_key(sector) %>%
model(
ets = ETS(energy_total ~ season(method = "M") + trend(method = "A"))
)
# Forecast using the model 5 years into the future:
energy_forecast <- energy_fit %>%
forecast(h = "5 years")
# Plot just the forecasted values (with 80 & 95% CIs):
energy_forecast %>%
autoplot()# Or plot it added to the original data:
energy_forecast %>%
autoplot(energy_ts)We can use broom::augment() to append our original tsibble with what the model predicts the energy usage would be based on the model. Let’s do a little exploring through visualization.
First, use broom::augment() to get the predicted values & residuals:
# Append the predicted values (and residuals) to original energy data
energy_predicted <- broom::augment(energy_fit)
# Use View(energy_predicted) to see the resulting data frameNow, plot the actual energy values (energy_total), and the predicted values (stored as .fitted) atop them:
ggplot(data = energy_predicted) +
geom_line(aes(x = date, y = energy_total)) +
geom_line(aes(x = date, y = .fitted), color = "red", alpha = .7)Now let’s explore the residuals. Remember, some important considerations: Residuals should be uncorrelated, centered at 0, and ideally normally distributed. One way we can check the distribution is with a histogram:
ggplot(data = energy_predicted, aes(x = .resid)) +
geom_histogram()